Multi-node system in which global address generated by processing subsystem includes global to local translation information

ABSTRACT

A system may include a plurality of nodes. Each node may include one or more active devices coupled to one or more memory subsystems. An active device included in one of the nodes includes a memory management unit configured to receive a virtual address generated within that active device and to responsively output a global address identifying a coherency unit. A portion of the global address identifies a translation function. A memory subsystem included in the node is configured to perform the translation function identified by the portion of the global address on an additional portion of the global address in order to obtain a local physical address of the coherency unit. Each active device included in the node is configured to use the portion of the global address identifying the translation function when determining whether a local copy of the coherency unit is currently stored in a cache associated with that active device.

PRIORITY INFORMATION

This application claims priority to U.S. provisional application Ser.No. 60/460,579, entitled “MULTI-NODE SYSTEM IN WHICH GLOBAL ADDRESSGenerated by PROCESSING SUBSYSTEM INCLUDES GLOBAL TO LOCAL TRANSLATIONINFORMATION”, filed Apr. 4, 2003.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to the field of multiprocessing computer systemsand, more particularly, to performing coherent memory replication withinmultiprocessing computer systems.

2. Description of the Related Art

Multiprocessing computer systems include two or more processors that maybe employed to perform computing tasks. A particular computing task maybe performed on one processor while other processors perform unrelatedcomputing tasks. Alternatively, components of a particular computingtask may be distributed among multiple processors to decrease the timerequired to perform the computing task as a whole.

A popular architecture in commercial multiprocessing computer systems isthe symmetric multiprocessor (SMP) architecture. Typically, an SMPcomputer system includes multiple processors connected through a cachehierarchy to a shared bus. The bus provides the processors access to ashared memory. Access to any particular memory location within thememory occurs in a similar amount of time as access to any otherparticular memory location. Since each location in the memory may beaccessed in a uniform manner, this structure is often referred to as auniform memory architecture (UMA).

Processors are often configured with internal caches, and one or morecaches are typically included in the cache hierarchy between theprocessors and the shared bus in an SMP computer system. Multiple copiesof data residing at a particular main memory address may be stored inthese caches. In order to maintain the shared memory model, in which aparticular address stores exactly one data value at any given time,shared bus computer systems employ cache coherency. An operation iscoherent if the effects of the operation upon data stored at aparticular memory address are reflected in each copy of the data withinthe cache hierarchy. For example, when data stored at a particularmemory address is updated, the update may be supplied to the caches thatare storing copies of the previous data. Alternatively, the copies ofthe previous data may be invalidated in the caches such that asubsequent access to the particular memory address causes the updatedcopy to be transferred from main memory. For shared bus systems, a snoopbus protocol is typically employed. Each coherent transaction performedupon the shared bus is examined (or “snooped”) against data in thecaches. If a copy of the affected data is found, the state of the cacheline containing the data may be updated in response to the coherenttransaction.

Unfortunately, shared bus architectures suffer from several drawbackswhich limit their usefulness in multiprocessing computer systems. A busis capable of a peak bandwidth (e.g., a number of bytes/second which maybe transferred across the bus). As additional processors are attached tothe bus, the bandwidth required to fully supply the processors with dataand instructions may exceed the peak bus bandwidth. Since someprocessors are forced to wait for available bus bandwidth, performanceof the computer system suffers when the bandwidth requirements of theprocessors exceeds available bus bandwidth. Performance may also beadversely affected due to capacitive loading on the shared bus, whichincreases as more processors are added to the system. Furthermore, asprocessor performance increases, buses that previously providedsufficient bandwidth for a multiprocessing computer system may beinsufficient for a similar computer system employing higher performanceprocessors.

Another structure for multiprocessing computer systems is a distributedshared memory architecture. A distributed shared memory architectureincludes multiple nodes, each of which includes one or more processorsand one or more memory devices. The multiple nodes communicate via anetwork. When considered as a whole, the memory included within themultiple nodes forms the shared memory for the computer system.Typically, directories are used to identify which nodes have cachedcopies of data corresponding to a particular address. Coherencyactivities may be generated via examination of the directories.

Distributed shared memory systems are scaleable, overcoming thelimitations of the shared bus architecture. Since many of the processoraccesses are completed within a node, nodes typically have much lowerbandwidth requirements upon the network than a shared bus architecturemust provide upon its shared bus. The nodes may operate at high clockfrequency and bandwidth, accessing the network when needed. Additionalnodes may be added to the network without affecting the local bandwidthof the nodes. Instead, only the network bandwidth is affected.

Distributed shared memory systems may employ local and global addressspaces. The global address space encompasses memory in more than onenode. In contrast, local physical address space may only describe memorywithin a single node. Accesses to the address space within a node (i.e.,access to local physical address space) are typically localtransactions, which may not involve activity on the network that couplesthe nodes. Accesses to portions of the address space not assigned to therequesting node are typically global transactions and may involveactivity on the network.

In some distributed shared memory systems, data corresponding toaddresses of remote nodes may be copied to a requesting node's sharedmemory such that future accesses to that data may be performed via localtransactions rather than global transactions. In such systems,processors local to the node may access the data using the localphysical address assigned to the copied data. Remote processors externalto that node may use the global address to access the data. Addresstranslation tables are provided to translate between the global addressand the local physical address. Improved systems for implementingaddress translations between global and local physical addresses aredesired.

SUMMARY

Various embodiments of systems and methods for performing virtual toglobal address translation in a processing subsystem within a multi-nodecomputer system are disclosed. In one embodiment, a system may include aplurality of nodes. Each node may include one or more active devicescoupled to one or more memory subsystems. An active device included inone of the nodes includes a memory management unit configured to receivea virtual address generated within that active device and toresponsively output a global address identifying a coherency unit. Aportion of the global address identifies a translation function. Amemory subsystem included in the node is configured to perform thetranslation function identified by the portion of the global address onan additional portion of the global address in order to obtain a localphysical address of the coherency unit. Each active device included inthe node is configured to use the portion of the global addressidentifying the translation function when determining whether a localcopy of the coherency unit is currently stored in a cache associatedwith that active device.

BRIEF DESCRIPTION OF THE DRAWINGS

A better understanding of the present invention can be obtained when thefollowing detailed description is considered in conjunction with thefollowing drawings, in which:

FIG. 1 is a block diagram of a multiprocessing computer system,according to one embodiment.

FIG. 2 shows a node within a multiprocessing computer system, accordingto one embodiment.

FIG. 3 is a flowchart of a method of performing an intra-node coherencytransaction, according to one embodiment.

FIG. 4 illustrates a processing device that includes a virtual-to-globaladdress translation table, according to one embodiment.

FIG. 5A illustrates a memory device that includes a global-to-localphysical address translation table, according to one embodiment.

FIG. 5B illustrates a memory device that stores translation informationused in other nodes that are replicating a particular coherency unit,according to one embodiment.

FIG. 6A shows an exemplary set of address translations that may beperformed in a multi-node system, according to one embodiment.

FIG. 6B shows another exemplary set of address translations that may beperformed in a multi-node system, according to one embodiment.

FIG. 7 is a flowchart of a method of performing an coherency transactioninvolving multiple nodes, according to one embodiment.

FIG. 8 shows an exemplary translation lookaside buffer entry, accordingto one embodiment.

FIG. 9 is a flowchart of another embodiment of a method of performing ancoherency transaction within a node of a multi-node computer system.

While the invention is susceptible to various modifications andalternative forms, specific embodiments thereof are shown by way ofexample in the drawings and will herein be described in detail. Itshould be understood, however, that the drawings and detaileddescription thereto are not intended to limit the invention to theparticular form disclosed, but on the contrary, the intention is tocover all modifications, equivalents and alternatives falling within thespirit and scope of the present invention as defined by the appendedclaims.

DETAILED DESCRIPTION OF EMBODIMENTS

Multi-Node Computer System

FIG. 1 illustrates a multi-node computer system 100, according to oneembodiment. In the embodiment of FIG. 1, multi-node computer system 100includes three nodes 140A-140C (collectively referred to as nodes 140).Each node includes several client devices. For example, node 140Aincludes processing subsystems 142AA and 142AB, memory subsystems 144AAand 144AB, I/O subsystem 146A, and interface 148A. The client devices innode 140A share address network 150A and data network 152A. In theillustrated embodiment, nodes 140B and 140C contain similar clientdevices (identified by reference identifiers ending in “B” and “C”respectively). Note that different nodes may include different numbersof and/or types of client devices, and that some types of client devicesmay not be included in some nodes.

As used herein, a node is a group of client devices (e.g., processingsubsystems 142, memory subsystems 144, and/or I/O subsystems 146) thatshare the same address and data networks. By linking multiple nodes, thenumber of client devices in the computer system 100 may be adjustedindependently of the size limitations of any individual node 140.

Each node 140 communicates with other nodes in computer system 100 viaan interface 148 (interfaces 148A-148C are collectively referred to asinterfaces 148). Some nodes may include more than one interface.Interfaces 148 may communicate by sending packets of address and/or datainformation on inter-node network 154.

Each of processing subsystems 142 and I/O subsystem 146 may accessmemory subsystems 144. Devices configured to perform accesses to memorysubsystems 144 are referred to herein as “active” devices. Because eachactive device within computer system 140 may access data in memorysubsystems 144, potentially caching the data, memory subsystems 144 andactive devices such as processing systems 142 may implement a coherencyprotocol in order to maintain data coherency between processingsubsystems 142 and memory subsystems 144. Each client in FIG. 1 may beconfigured to participate in the coherency protocol by sending addressmessages on address network 150 and data messages on data network 152using split-transaction packets.

Memory subsystems 144 are configured to store data and instruction codefor use by processing subsystems 142 and I/O subsystem 146. Memorysubsystems 144 may include dynamic random access memory (DRAM), althoughother types of memory may be used in some embodiments.

I/O subsystem 146 is illustrative of a peripheral device such as aninput-output bridge, a graphics device, a networking device, etc. Insome embodiments, I/O subsystem 146 may include a cache memory subsystemsimilar to those of processing subsystems 142 for caching dataassociated with addresses mapped within one of memory subsystems 144.

In one embodiment, data network 152 may be a logical point-to-pointnetwork. Data network 152 may be implemented as an electrical bus, acircuit-switched network, or a packet-switched network. In embodimentswhere data network 152 is a packet-switched network, packets may be sentthrough the data network using techniques such as wormhole, store andforward, or virtual cut-through. In a circuit-switched network, aparticular client device may communicate directly with a second clientdevice via a dedicated point-to-point link that may be establishedthrough a switched interconnect mechanism. To communicate with a thirdclient device, the particular client device utilizes a different link asestablished by the switched interconnect than the one used tocommunicate with the second client device. Messages upon data network152 are referred to herein as data packets. Note that in someembodiments, address network 150 and data network 152 may be implementedusing the same physical interconnect.

Address network 150 accommodates communication between processingsubsystems 142, memory subsystems 144, and I/O subsystem 146. Messagesupon address network 150 are generally referred to as address packets.In some embodiments, address packets may correspond to requests for anaccess right (e.g., a readable or writable copy of a cacheable coherencyunit) or requests to perform a read or write to a non-cacheable memorylocation. Address packets may be sent by an active device in order toinitiate a coherency transaction. Subsequent address packets may be sentby other devices in order to implement the access right and/or ownershipchanges needed to satisfy the coherence request. In the computer system140 shown in FIG. 1, a coherency transaction may include one or morepackets upon address network 150 and data network 152. Typical coherencytransactions involve one or more address and/or data packets thatimplement data transfers, ownership transfers, and/or changes in accessprivileges. If activity within more than one node 140 is needed tocomplete a coherency transaction, that coherency transaction may alsoinvolve one or more packets on inter-node network 154.

When an address packet references a coherency unit, the referencedcoherency unit may be specified via an address conveyed within theaddress packet upon address network 150. As used herein, a coherencyunit is a number of contiguous bytes of memory that are treated as aunit for coherency purposes. For example, if one byte within thecoherency unit is updated, the entire coherency unit is considered to beupdated. In response to an address packet that references a coherencyunit, data corresponding to the address packet on the address network150 may be conveyed upon data network 152. Communications upon addressnetwork 150 may be point-to-point or broadcast, depending on theembodiment.

Various active devices such as I/O subsystems 146 and/or processingsubsystems 142 may be configured to access data in any node 140 withincomputer system 100. Several different address spaces may be used todescribe the data stored in computer system 100. Virtual addresses,which may be generated within each processing device while executingprogram instructions, may form one address space. A global address spacemay include addresses that identify each unique coherency unit storedwithin any of the nodes in computer system 100, allowing a device in onenode to identify data stored in another node. Local physical addressspace may be unique to each node and contains the physical addressesthat are used to access coherency units within the local memory of eachnode. The local memory of each node includes the memory included in thememory subsystem(s) 144 in that node 140. A memory subsystem 144 is saidto “map” a particular global address if the data identified by thatglobal address is stored at a local physical address within that memorysubsystem. Various translation functions may map an address specified inone address space to an address within another address space, asdescribed in more detail below.

Active devices within each node 140 may be configured to use globaladdresses to specify data when sending address packets in coherencytransactions. An active device in one node 140A may access data inanother node 140B by sending an address packet specifying the data'sglobal address. The memory subsystems 144 may translate a global addressreceived in an address packet to a local physical address and use thatlocal physical address to access the specified coherency unit.

Nodes 140 may perform coherent memory replication so that memorysubsystems 144 in different nodes may store copies of the same coherencyunit. A replicated coherency unit may be identified by a particularglobal address, and each memory subsystem 144 that replicates thatcoherency unit maps that global address to a local physical address.Each replicating node may map the global address to a different localphysical address. After performing coherent memory replication, anactive device within a replicating node may access the replicated copyof data from a local memory subsystem instead of having to access thedata from a memory subsystem in another node. Each node may replicatedifferent portions of the global address space. The coherency protocolmay maintain coherency both among the various caches that may store acopy of a particular coherency unit and among the various memorysubsystems that may replicate a copy of a particular coherency unit.

A node may be described as being a “mapping” node for a particularcoherency unit if a memory subsystem 144 within that node 140 maps thecoherency unit. A coherency unit may have multiple mapping nodes. Insome embodiments, a single mapping node may be designated as the homenode for each coherency unit. The home node for a particular coherencyunit may serve as an ordering point for multi-node coherencytransactions involving that coherency unit. A node is a non-mapping nodewith respect to a particular coherency unit if that node does notinclude any memory subsystem that maps the coherency unit. Globaladdresses may also be described as being “mapped” global addresses andnon-mapped global addresses with respect to a particular node dependenton whether that node is a mapping or non-mapping node for that globaladdress.

FIG. 2 shows a block diagram of a node 140A, according to oneembodiment. Note that other embodiments may include different numbersand/or types of devices. As shown, a processing subsystem 142AA mayinclude a memory management unit (MMU) 200. MMU 200 include logic 202 toperform a virtual address (VA) to global address (GA) translation uponthe data addresses generated by the instruction code executed upon theprocessing core of processing subsystem 142AA, as well as theinstruction addresses generated by the processing subsystem 142AA. Theaddresses generated in response to instruction execution are virtualaddresses. In other words, the virtual addresses are the addressescreated by the programmer of the instruction code. The virtual addressesare passed through an address translation mechanism 202 (embodied in MMU200), from which corresponding global addresses are generated. MMU 200may include a TLB (Translation Lookaside Buffer) in which to cacherecently accessed translations.

Virtual to global address translation may be performed for many reasons.For example, the address translation mechanism may be used to grant ordeny a particular computing task's access to certain global memoryaddresses. In this manner, the data and instructions within onecomputing task are isolated from the data and instructions of anothercomputing task. Additionally, portions of the data and instructions of acomputing task may be “paged out” from a memory subsystem 144 to a harddisk drive. When a portion of the data is paged out, the translation(s)corresponding to that data are invalidated. Upon access to the paged-outportion by the computing task, an interrupt occurs due to theinvalidated translation. The interrupt allows the operating system toretrieve the corresponding information from the hard disk drive. In thismanner, more virtual memory may be available than actual memorydescribed in the global address space. Virtual addresses may also beused for other reasons.

The global address computed by MMU 200 defines a location within theglobal address space associated with computer system 100. Thus, theglobal address may identify a mapped coherency unit stored within alocal memory (e.g., memory subsystem 144AA) or a non-mapped coherencyunit stored within a remote memory included in another node. The globaladdress generated by MMU 200 may be used to determine whether theprocessing subsystem 142AA currently has a copy of the specifiedcoherency unit cached in a local cache. If any coherency transactionsare needed to obtain a particular access right to that coherency unit,the processing subsystem 142AA may communicate an address packet thatincludes the global address on the address network 150A. Otherprocessing subsystems in that node 140A may use the global address todetermine whether their caches are currently storing a copy of thecoherency unit specified by that global address. For example, each otherprocessing system in node 140A may use at least a portion of the bits ofthe global address to access a tag array indicating which globaladdresses are currently cached by that processing subsystem.

In some embodiments, memory subsystem 144AA may be coupled to processingsubsystems 142AA by address network 150A, as shown in FIG. 2. In suchembodiments, processing subsystem 142AA may request access to acoherency unit stored in memory subsystem 144AA by sending an addresspacket containing the global address generated by MMU 200 on addressnetwork 150A. In alternative embodiments, a memory controller may beintegrated with processing subsystem 142AA (e.g., both the memorycontroller and the processing subsystem may be integrated in a singleintegrated circuit). In these alternative embodiments, the globaladdress generated by MMU 200 may be provided directly to the integratedmemory controller within processing subsystem 142AA without beingtransmitted in an address packet on address network 150A.

A memory controller included within memory subsystem 144AA may includelogic 204 configured to translate the global address generated by theprocessing subsystem's MMU 200 into a local physical address (LPA).Whenever a global address is received (either from address network 150Aor directly from a processing subsystem 144AA with which the memorycontroller is integrated), the memory controller may input the globaladdress into the GA to LPA logic 204 in order to obtain thecorresponding LPA. Some global addresses may not be mapped by memorysubsystem 144AA, and these unmapped global addresses may not betranslated by GA to LPA logic 204. Memory subsystem 144AA mayeffectively ignore address packets specifying these unmapped globaladdresses.

In addition to generating global addresses from virtual addresses, MMU200 may also generate a set of one or more translation bits for eachglobal address. The translation bits may identify one of variousdifferent translation functions that may be used to map a global addressinto the local physical address space of memory subsystem 144AA (or anyother memory subsystem within node 140A). In one embodiment, localphysical addresses for which the memory subsystem 144AA is the homememory subsystem may be the same as the global addresses generated byMMU 200. The translation bits generated by MMU 200 for such a globaladdress may indicate that no translation function should be performed onthe global address to obtain the local physical address. In otherembodiments, a relatively straightforward transformation may relateglobal address to local physical addresses for which memory subsystem144AA is the home memory subsystem. For example, in one embodiment, eachmemory subsystem may remove a portion of the global address or replace aportion of the global address with one or more local address bits. Notethat the translation bits corresponding to a particular global addressmay vary from node to node.

If a particular global address is neither local to nor replicated withinthe node, the translation bits may indicate that no translation shouldbe performed since there is no local physical address for that globaladdress. A memory subsystem 144 may use these translation bits todetermine whether to input a global address to translation logic 204.Similarly, an interface 148A may use these translation bits to determinewhether to forward a coherency request to another node. For example, aninterface 148A may be configured to always forward coherency requeststhat specify non-local, non-replicated global addresses, as indicated bythe value of the translation bits included in the address packet, to oneor more other nodes 140 via inter-node network 154.

In some embodiments, unmapped global addresses may have the sametranslation function bits as global addresses for which the node is thehome node. The memory subsystem 144AA (and/or the interface 148A) may beconfigured to differentiate unmapped global addresses from mapped globaladdresses based on the global address range in which each global addressis included. For example, the memory subsystem 144AA may track whichportions of the global address space are currently mapped to that memorysubsystem and use the tracked information to differentiate mapped andunmapped global addresses. In other embodiments, different translationbits may be used to distinguish mapped addresses from unmappedaddresses.

If a particular global address has been replicated within memorysubsystem 144AA and memory subsystem 144AA is not the home node, one ofvarious different translation functions may have been used to map thatglobal address to a local physical address within the memory subsystem144AA. The translation bits generated by MMU for that global address mayidentify the particular translation. The GA to LPA logic 204 included inmemory subsystem 144AA may use these translation bits to select theappropriate translation function to apply to the global address in orderto obtain the local physical address.

FIG. 3 illustrates one embodiment of a method of operating a multi-nodecomputer system. At 601, a processing subsystem accesses the processingsubsystem's translation lookaside buffer to translate a virtual addressto a global address. Accessing the translation lookaside buffer mayretrieve a global address and one or more bits identifying a translationfunction associated with the virtual address. The processing subsystemmay encode both the global address and the bits identifying thetranslation function into an address packet and forward the addresspacket on the address network (e.g., in order to initiate a coherencytransaction for that coherency unit). A memory subsystem that maps theglobal address may use the bits identifying the translation function toselect which translation to apply to the global address in order toobtain the local physical address of the coherency unit within thatmemory subsystem, as shown at 603. The memory subsystem may then use thelocal physical address to access the specified coherency unit in memory.Other processing subsystems may use the global address to detect whetherthey are currently caching a copy of the coherency unit specified by theglobal address, as shown at 605.

FIG. 4 illustrates an exemplary MMU 200, according to one embodiment.Here, the MMU 200 includes a translation lookaside buffer (TLB) 202 usedto translate virtual addresses generated within a processor 142 intoglobal addresses. The TLB may include several entries, each of which mayinclude a global address 212 and a translation 222. Note that in manyembodiments, only a portion (e.g., the base address of a page) of theglobal address 212 may be actually stored in TLB 202. The TLB 202 mayuse a portion of the virtual address to select the appropriate TLB entryand combine the portion of the global address 212 stored in that entrywith a portion (e.g., a page offset) of the virtual address to generatethe total global address 212. Additionally, the TLB entry may output atranslation 222 corresponding to that global address. The translation222 may be a set of one or more bits identifying a translation function(e.g., a hashing function or other manipulation) to be applied to theglobal address to generate the local physical address (LPA) within thatnode. If there is no corresponding LPA within that node, the translationbits 222 may indicate that the global address is not mapped to the nodeand/or that no translation should be applied to that global address. Thememory subsystem 144 may in turn be configured to detect whether it mapssuch an address by comparing the global address to one or more ranges ofmapped global address and/or by identifying the global address as anunmapped address in response to the value of the translation bits 222.

Translation information may be cached in a TLB entry in the TLB 202 inresponse to the translation information being used to translate avirtual address. The information may be more permanently stored in pagetables within memory (e.g., included in memory subsystem 144AA). Thepage tables may be created by an operating system executing on one ormore of the processing subsystems 142. The instructions implementing theoperating system may themselves also be stored in a memory subsystem144. Note that the same page table structure and TLB structure may beused to map both mapped and unmapped addresses.

For replicated global addresses, the operating system may select whichtranslation function to use to map that global address into the localphysical address space dependent on which portions of local physicaladdress space are currently available to be mapped to replicatedaddresses. The portion(s) of local physical address space available tomap replicated addresses may be effectively handled as an associativecache into which replicated addresses may be mapped in some embodiments.The available range into which certain global addresses may be mappedmay be limited. For example, certain translation functions may notuniquely map the entire range of non-home global addresses to uniqueLPAs; the operating system may not use such a translation function totranslate any of the non-home global addresses that translation functionis not capable of mapping to a unique LPA. The decision as to whether toreplicate or not replicate a particular non-home global address may bemade on a per-node basis (e.g., based on one or more criteria such ascurrent access patterns, user-selected constraints, performance impact,etc.).

In some embodiments, the set of available translation functions mayallow any LPA (other than those allocated to home addresses) to bemapped to any GA. In other embodiments, a more limited set oftranslation functions may be available (e.g., in order to limit thenumber of translation bits 222 required to uniquely identify one of thetranslation functions), which may in turn restrict the set of LPAs towhich a particular GA may be mapped. For example, in one embodiment,sixteen or fewer translation functions may be available, allowing anytranslation function to be identified using four translation bits 222.

FIG. 5A illustrates the logic 204 included in a memory subsystem 144AAthat may be used to translate a global address to a local physicaladdress. The GA to LPA translation logic 204 may receive the globaladdress and the translation bits 222. Depending on which translationfunction, if any, is identified by the translation bits 222, thetranslation logic 204 selectively uses that translation function tomodify at least a portion of the global address to generate the localphysical address 190. For example, as with the TLB translation, only aportion of each global address may be translated to generate the localphysical address. The un-translated portion (e.g., a page offset) maythen be concatenated with the translated portion to generate the localphysical address. Note that in other embodiments, however, the entireglobal address may be translated to generate the local physical address.After the translation function is applied to the global address, theresulting local physical address may be used to perform an access to amemory device included in the memory subsystem 144AA.

Due to the ability of active devices to access data in multiple nodes,there is a possibility that a coherency unit may be cached in any node.The coherency protocol may support coherency transactions involving morethan one node. In order to communicate effectively with active devicesin other nodes, each active device may use global addresses to specifycoherency units. However, since each node may use a differenttranslation to map a global address to local address space, thetranslation 222 associated with each global address in each node that isreplicating that global address may also be necessary to be able toaccess the coherency unit in the mapping memory subsystem in eachreplicating node.

Translation information 222 for at least some of the nodes 140 thatreplicate a coherency unit may be stored by the home memory subsystemfor that coherency unit. For example, the home memory subsystem for acoherency unit may store information identifying which nodes currentlyreplicate that coherency unit and which translation function eachreplicating node uses to map that coherency unit's global address intothe replicating node's local address space, as shown in FIG. 5B. FIG. 5Bshows exemplary information 240 memory subsystem 144AA may store for acoherency unit whose home memory subsystem is memory subsystem 144AA.The memory subsystem 144AA may store the information 240 in memory(e.g., RAM) or in a separate cache or metadata storage. In someembodiments, the information may be stored in a table indexed by all orpart of the global address 212 of each coherency unit for which thatnode is the home node. Alternatively, the information may be indexed orstored according to local physical address. For each home coherencyunit, the memory subsystem 144AA may store information identifying thetranslations 222 used in each node that replicates that coherency unit.FIG. 5B shows an exemplary entry 242 in a translation information table240 for a coherency unit that is replicated in nodes 140B and 140C. Theentry 242 includes a translation 222B for node 140B and a translation222C for node 140C.

In some embodiments, if a coherency transaction involves multiple nodes,a packet indicating the coherency transaction may be sent to the homenode for the specified coherency unit. The interface 148 in the homenode may then provide a packet indicating the coherency transaction tothe coherency unit's home memory subsystem. If replicating nodes otherthan the home node and the initiating node need to participate in thecoherency transaction, the home memory subsystem may send the interface148 in the home node the translation information 222 associated witheach other replicating node whose participation is required in thecoherency transaction. The interface 148 may provide these replicatingnodes with a packet indicating the desired coherency activity to beperformed in each node. The node-specific translation information 222associated with each node may also be included in that packet, allowingthe specified coherency unit to be accessed in the local physicaladdress space of each replicating node.

In embodiments in which the home memory subsystem maintains thetranslation information for each replicating node, the interfaces 148may not need to maintain this translation information. Interfaces 148may also avoid performing any translations on addresses whencommunicating with other nodes. Accordingly, in these embodiments,interfaces 148A and 148B may not need to store translation bits 222 forcoherency units replicated in the interfaces' respective nodes (or,alternatively, for coherency units whose home nodes are the interfaces'respective nodes). In such embodiments, the global address 212 andtranslation bits 222 output by an interface 148 are the same as theglobal address 212 and translation bits 222 received by that interface148. In other words, no address translation may be performed within theinterface 148, either before sending packets on the local addressnetwork or before sending packets on the inter-node network 154. In someembodiments, if other metadata (e.g., directory information) is alreadybeing looked up for the coherency unit in the home memory subsystem,retrieving the translation information from the home memory subsystemmay not add significant latency to the coherency transaction. Note thatalternative embodiments may instead store node-specific global to localphysical address translation information in interfaces 148 instead of inmemory subsystems 144. For example, instead of storing translationinformation for each replicating node in the home memory subsystem, thetranslation information for each replicating node may be stored in thatnode's interface(s) 148. When packets involved in coherency transactionsare received from other nodes via inter-node network 154, the interface148 in a replicating node may append the node-specific translationinformation for the specified coherency unit in the replicating node andprovide that information on the node's address network.

FIG. 6A illustrates how virtual, global, and local physical addressesmay be generated and transmitted within various part of computer system100. Two exemplary nodes 140A and 140B are illustrated in FIG. 6A. Onenode 140A is the home node for a particular coherency unit. Accordingly,a memory subsystem 144AA included in node 140A maps that coherency unit.In other words, a copy of that coherency unit is stored at a localphysical address within the local physical address space correspondingto memory subsystem 144AA. Additionally, a memory subsystem 144AB inanother node 140B replicates that coherency unit. Note that while memorysubsystems 144AA and 144AB may each map the coherency unit to adifferent local physical address, the same global address 212 is used toidentify the coherency unit within both nodes 140A and 140B.

When the coherency unit is replicated in node 140B, the home memorysubsystem 144AA for that coherency unit may store information indicatingthat the coherency unit is replicated in node 140B. The home memorysubsystem 144AA may also store the translation bits 222B identifyingwhich translation is used to map the coherency unit's global address 212to a local physical address in memory subsystem 144AB. This informationmay be generated and stored by the operating system that decides toreplicate the coherency unit during the replication process.

In this example, a processing subsystem 142AA initiates a coherencytransaction to request an access right for the coherency unit that isreplicated in node 140B. As part of a coherency transaction, thecoherency unit may need to be obtained from node 140B and the variousdevices' access rights to that coherency unit may need to be modified.For example, a device in node 140B may have write access to thecoherency unit, and processing subsystem 142AA may need to obtain a copyof that coherency unit from the device with write access in order tohave the most up-to-date copy of that coherency unit. Processingsubsystem 142AA may also need to modify the other device's access right(e.g., to a shared access right, if processing subsystem 142AA isrequesting shared access) as part of the coherency transaction.

When processing subsystem 142AA initiates a coherency transaction togain an access right to the coherency unit, processing subsystem 142AAmay output the global address 212 and translation 222A associated withthat coherency unit in an address packet. A different translationfunction may be used to map the global address identifying thatcoherency unit into memory subsystems 144AA and 144AB, and thus thetranslation 222A may differ from a translation 222B associated withglobal address 212 in node 140B. If the address packet is broadcast toall of the client devices within the node, any other processingsubsystems 142 within node 140A may each use the global address todetermine whether that processing subsystem has a cached copy of thecoherency unit. Memory subsystem 144AA may receive the address packetvia the address network of node 140A and determine whether anyinter-node coherency activity is required to complete the coherencytransaction. If, for example, memory subsystem 144AA earlier received anindication that a device in node 140B requested write access to thecoherency unit, memory subsystem 144AA may determine that node 140B mayhave a more recently updated version of the coherency unit that shouldbe provided to processing subsystem 142AA as part of the coherencytransaction. Accordingly, memory subsystem 144AA may determine that thecoherency transaction may not be completed without the involvement ofnode 140B.

In response to determining that node 140B's involvement is needed tocomplete the coherency transaction, memory subsystem 144AA may providethe translation 222B used to map that global address 212 to node 140B'slocal physical address space to interface 148 for communication to node140B. Interface 148B may responsively communicate a packet to node 140Bvia the inter-node network indicating the coherency transaction, theglobal address 212, and the translation 222B. In some embodiments, thehome memory subsystem 144AA may cause interface 148B to send a packet tonode 140B by forwarding the address packet sent by processing subsystem142A to interface 148B upon determining that the coherency transactioncannot be completed within node 140A. Before sending the address packetto interface 148A, memory subsystem 144AA may replace the translationbits 222A with 222B or append translation bits 222B to the packetgenerated by processing subsystem 142A. Alternatively, interface 148Bmay send a packet requesting the appropriate translation bits 222B forglobal address 212 to memory subsystem 144AA in response to receivingthe address packet sent by processing subsystem 142AA. In response tomemory subsystem 144AA returning the translation 222B for that globaladdress 212, the interface 148A may send a packet on the inter-nodenetwork to node 140B containing the global address 212, the translation222B, and an indication of the coherency transaction.

Interface 148B may include the global address 212, translation 222, andindication of the requested coherency activity received from node 140Ain an address packet sent on the address network within node 140B. Ifthis packet is broadcast, processing subsystems such as processingsubsystem 142AB may use the global address 212 to determine whether acopy of the coherency unit is stored in that processing subsystem'scache. When memory subsystem 144AB receives the packet, memory subsystem144AB may identify the client device in node 140B that should respond tothe address packet. For example, if memory subsystem 144AB has ownershipof the specified coherency unit, the memory subsystem 144AB may respondby sending a copy of the requested coherency unit and/or by modifying anaccess right or responsibility associated with that coherency unit. Ifthe memory subsystem 144AB is responding by sending a copy of thespecified coherency unit, as shown in FIG. 6A, memory subsystem 144ABmay use the translation bits 222B (provided by memory subsystem 144AA)to select which translation function to apply to global address 212 inorder to obtain the local physical address of the coherency unit. Uponobtaining the local physical address, memory subsystem 144AB may accessthe coherency unit in memory and return a copy of the coherency unit tothe requesting node 140A via interface 148B.

FIG. 6B illustrates a similar coherency transaction initiated in a node140C that is neither the home node 140A nor a replicating node 140B forthe requested coherency unit. Here, a processing subsystem 142AC in node140C initiates a coherency transaction for a coherency unit byoutputting the coherency unit's global address 212 on node 140C'saddress network. Since the coherency unit is not replicated within node140C, the translation bits 222C generated by processing subsystem 142ACmay indicate that the coherency unit is not replicated and/or that notranslation is needed. Assuming that no active device in node 140C canperform the necessary data transfers, access right transitions, and/orownership transitions to complete the coherency transaction, interface148C may forward a packet indicating the global address 212 and therequested coherency transaction to the home node 140A.

Once the request is forwarded to the home node 140C, the coherencytransaction may proceed similarly to that shown in FIG. 6A. If the homenode can satisfy the coherency transaction, the interface 148A in thehome node 140A may return a copy of the requested coherency unit to theinitiating node 140C. If the home node 140C cannot satisfy the coherencytransaction, the home node may forward the request to another node thatcan satisfy the coherency transaction. In this example, the home node140C forwards the request to a node 140B that is replicating thespecified coherency unit. The home node includes the global address 212and the translation bits 222B associated with node 140B in the coherencyrequest forwarded to node 140B. As in FIG. 6A, the memory subsystem144AB may use the translation bits 222B to obtain the local physicaladdress of the specified coherency unit. The memory subsystem 144AB maythen return a copy of the specified coherency unit to the requestingnode 140C via interface 148B.

FIG. 7 shows another flowchart of a method of operating a multi-nodecomputer system, according to one embodiment. At 701, a processingsubsystem accesses its TLB to translate a virtual address to a globaladdress and to obtain local translation bits associated with that globaladdress. The processing subsystem forwards both the global address andthe local translation bits in an address packet on a local addressnetwork. Other processing subsystems in the same node as that processingsubsystem may use the global address to determine whether the specifiedcoherency unit is locally cached by those processing subsystems. Thememory subsystem in the home node that maps the coherency unitidentified by the global address generated at 701 may determine thatanother node is mapping the specified coherency unit. If the othernode's participation in the coherency transaction is needed, the homememory subsystem may retrieve remote translation bits associated withthat global address at the other node, as shown at 703. Note that eachdifferent node may associate a different set of translation bits withthe same global address. In other words, each node may use a differenttranslation function to map the same global address to a different localphysical address.

The home memory subsystem may provide the global address and the remotetranslation bits to an interface to the other node. As shown at 705, theinterface receives the global address and remote translation bits andforwards both to an interface in the other node. A memory subsystem inthe other node uses the remote translation bits to select whichtranslation function to apply to the global address, as indicated at707. By applying the selected translation to the global address, theremote memory subsystem generates the appropriate local physical addresswithin the local physical address space of that node.

In some embodiments, the translation information may be part of theglobal addresses generated by active devices, as opposed to beinghandled as a separate piece of address information as described above.FIG. 8 illustrates an exemplary processing subsystem 142 that may beincluded in such an embodiment. Here, the processing subsystem 142includes a TLB 202. Each TLB 202 entry may include a global address 212.A portion 222 of the bits in the global address may indicate thetranslation function associated with that global address. Since eachnode may use a different translation function to map a given globaladdress into local physical address space, the portion 222 of the globaladdress 212 that identifies the translation function may vary amongnodes. Active devices may use some of the global address bits that areused to specify local translations for certain global addresses todetermine whether a copy of the specified coherency unit is locallycached.

In some embodiments, certain global addresses may be replicable whileothers may not. One bit of the global address (e.g., the highest orderbit) may indicate whether that global address is replicable. If theaddress is not replicable, the portion 222 of the global address 212that would otherwise be used to store translation bits may instead beused as normal address bits. Accordingly, the range of addressableglobal address space allocated to non-replicable global addresses may belarger than the range of addressable global address space allocated toreplicable global addresses.

By using a portion 222 of the global address itself to specify thetranslation function, a portion of the global address space mayeffectively go unused. For example, in one embodiment, the value of thehighest order bit in a 47-bit global address 212 may indicate whether anaddress is replicable or not. If an address is replicable, the nextthree highest order bits may be used to specify the translation functionused within that node for that global address. If the address is notreplicable, the next three highest order bits may instead be used tospecify the address. The local memory in the node may use the sametranslation function to handle all non-replicable addresses mapped tothat memory, so translation information may not be necessary for theseaddresses. Similarly, if a non-replicable address does not map to anymemory within the node, no translation information is necessary sincethe coherency unit will need to be retrieved from its home node. In suchan embodiment, the use of four (out of 47) address bits to specifytranslation information (three bits to indicate the translationfunction, one bit to indicate whether replicable) for replicable globaladdresses may reduce the effective global address space by 7/16ths.

In embodiments in which local translation identifiers are treated aspart of global addresses, the home memory subsystem for a givenreplicable coherency unit may store the portion 222 of the globaladdress used to identify that coherency unit in each node that iscurrently replicating that coherency unit. If multiple nodes arereplicating a given coherency unit, the home memory subsystem may trackmultiple different translation portions 222 of the coherency unit'sglobal address. The home memory subsystem may substitute the appropriateremote translation bits into the global address or otherwise provide thetranslation bits to the interface in the home node for transmission aspart of the global address in a packet sent to the remote node.Accordingly, the interface may not need to store this translationinformation for each replicating node and coherency transactions may beimplemented similarly to the examples of FIGS. 6A-6B. In turn, theinterface may simply forward the addressing information it receiveswithout needing to perform any addressing translations whencommunicating between nodes.

FIG. 9 is a flowchart of a method of performing address translations ina multi-node system, according to one embodiment. At 901, a processingsubsystem translates a virtual address to a global address, whichincludes one or more translation bits identifying a translationfunction, and forwards the global address on address bus. The addresstranslation may be performed by accessing a TLB. Each processingsubsystem in the multi-node system may be configured to perform similaraddress translations. In some embodiments, certain global addresses maynot include translation bits. For example, certain global addresses maynot be replicable in more than one node. One bit of global addressinformation may identify whether that global address is replicable ornot. If a global address is not replicable, the memory subsystem in thehome node for that global address may be configured to either notperform any address translation or to perform the same addresstranslation on all home global addresses to obtain the local physicaladdress. Accordingly, it may be unnecessary to specify any translationfunction in such a global address and the bits that would otherwise beused to specify a translation function may instead be used to specifyaddresses within an otherwise non-addressable portion of the globaladdress range. If a global address is replicable, a portion of theglobal address may be used to specify which translation function shouldbe used to translate that global address to a local physical address ina particular node. Note that different nodes may use differenttranslation functions to translate the same global address, and thus thetranslation function portion of the global address may differ from nodeto node.

At 903, a memory subsystem uses the translation bits to select whichtranslation to apply to the global address to generate a local physicaladdress. The local physical address locates the specified data withinthat memory subsystem. The processing subsystem that generates theglobal address at 901 may provide the global address to the memorysubsystem directly (e.g., if the processing subsystem and memorycontroller are both implemented in a single integrated circuit) or viathe node's address network.

The processing subsystem that performs the address translation at 901may output an address packet containing the global address on its node'saddress network in order to initiate a coherency transaction. Otherprocessing subsystems in the same node may receive the address packetfrom the address network and use the global address, including at leastsome of the bits that are used to specify the translation function, todetermine whether a copy of the coherency unit identified by that globaladdress is locally cached. Thus, unlike the implementation describedwith respect to FIG. 3 in which the bits used to specify the translationmay not be used when looking up a global address in a cache, at leastsome of the translation function bits may be used when determiningwhether the global address hits or misses in a local cache.

Numerous variations and modifications will become apparent to thoseskilled in the art once the above disclosure is fully appreciated. It isintended that the following claims be interpreted to embrace all suchvariations and modifications.

1. A system, comprising: a plurality of nodes, wherein each nodeincludes an active device and a memory subsystem coupled to the activedevice; wherein an active device included in a node of the plurality ofnodes includes a memory management unit configured to receive a virtualaddress generated within that active device and to responsively output aglobal address identifying a coherency unit, wherein a portion of theglobal address identifies a translation function; wherein a memorysubsystem included in the node is configured to perform the translationfunction identified by the portion of the global address on anadditional portion of the global address in order to obtain a localphysical address of the coherency unit; wherein each active deviceincluded in the node is configured to use the portion of the globaladdress identifying the translation function when determining whether alocal copy of the coherency unit is currently stored in a cacheassociated with that active device; wherein a home memory subsystemincluded in a home node of the plurality of nodes for the coherency unitis configured to store the portion of the global address identifying thetranslation function for the node, wherein active devices included inthe home node are configured to generate a different value of theportion of the global address, wherein the different value identifies adifferent translation function associated with the coherency unit in thehome node; and wherein if the home memory subsystem determines that thecoherency transaction involving the coherency unit cannot be completedwithin the home node, the home memory subsystem is configured to providethe portion of the global address identifying the translation functionfor an additional node to a home interface included in the home node forconveyance to the additional node.
 2. The system of claim 1, wherein atleast one bit included in the global address indicates whether thecoherency unit identified by the global address is replicable in morethan one of the plurality of nodes.
 3. The system of claim 2, wherein ifthe at least one bit included in a different global address indicatesthat the different global address is not replicable in more than one ofthe plurality of nodes, the portion of the different global addressincludes additional address bits instead of identifying a translationfunction.
 4. The system of claim 1, wherein the additional portion ofthe global address for the coherency unit generated by each activedevice in the plurality of nodes has a same value, and wherein activedevices in different nodes of the plurality of nodes generate differentvalues of the portion of the global address identifying the translationfunction.
 5. The system of claim 1, wherein the active device includedin the node is configured to output the global address in an addresspacket on an address network coupling the active device to an additionalactive device within the node in order to initiate a coherencytransaction for a coherency unit identified by the global address. 6.The system of claim 1, wherein a memory controller included in thememory subsystem is integrated in a same integrated circuit as theactive device.
 7. A method for use in a system comprising a plurality ofnodes, wherein each node includes an active device and a memorysubsystem coupled to the active device, the method comprising: an activedevice included in a node of the plurality of nodes translating avirtual address generated within the active device to a global addressidentifying a coherency unit, wherein a portion of the global addressidentifies a translation function; a memory subsystem included in thenode performing the translation function identified by the portion ofthe global address on an additional portion of the global address inorder to obtain a local physical address of the coherency unit; anadditional active device included in the node using the portion of theglobal address identifying the translation function when determiningwhether a local copy of the data is currently stored in a cacheassociated with the additional active device; wherein at least one bitincluded in the global address indicates whether the coherency unitidentified by the global address is replicable in more than one of theplurality of nodes; and the active device translating a differentvirtual address to a different global address, wherein the at least onebit included in the different global address indicates that thedifferent global address is not replicable in more than one of theplurality of nodes, and wherein the portion of the different globaladdress includes additional address bits instead of identifying atranslation function.
 8. The method of claim 7, further comprisingactive devices in different ones of the plurality of nodes generating asame value of the additional portion of the global address for thecoherency unit, and active devices in different ones of the plurality ofnodes generate different values of the portion of the global addressidentifying the translation function.
 9. The method of claim 7, furthercomprising a home memory subsystem included in a home node of theplurality of nodes for the coherency unit storing the portion of theglobal address identifying the translation function for the node,wherein active devices included in the home node generate a differentvalue of the portion of the global address, wherein the different valueidentifies a different translation function associated with thecoherency unit in the home node.
 10. The method of claim 9, wherein ifthe home memory subsystem determines that the coherency transactioninvolving the coherency unit cannot be completed within the home node,the home memory subsystem provides the portion of the global addressidentifying the translation function for the node a home interfaceincluded in the home node for conveyance to the node.
 11. The method ofclaim 7, further comprising the active device included in the nodeoutputting the global address in an address packet on an address networkcoupling the active device to an additional active device within thenode in order to initiate a coherency transaction for a coherency unitidentified by the global address.
 12. The method of claim 7, furthercomprising an operating system executing on the active device creating atranslation lookaside buffer entry corresponding to the virtual address,wherein the translation lookaside buffer entry includes the globaladdress, wherein the operating system selects the translation functionin order to map the virtual address to the local physical address withina non-replicated range of local physical addresses of the memorysubsystem.
 13. The method of claim 12, further comprising the operatingsystem executing on the active device in one of the nodes creating thetranslation lookaside buffer entry corresponding to the virtual addressin response to deciding to replicate the coherency unit to the node froman additional one of the plurality of nodes.
 14. A method for use in asystem comprising a plurality of nodes, wherein each node includes anactive device and a memory subsystem coupled to the active device, themethod comprising: an active device included in a node of the pluralityof nodes translating a virtual address generated within the activedevice to a global address identifying a coherency unit, wherein aportion of the global address identifies a translation function; amemory subsystem included in the node performing the translationfunction identified by the portion of the global address on anadditional portion of the global address in order to obtain a localphysical address of the coherency unit; an additional active deviceincluded in the node using the portion of the global address identifyingthe translation function when determining whether a local copy of thedata is currently stored in a cache associated with the additionalactive device; and an operating system executing on the active devicecreating a translation lookaside buffer entry corresponding to thevirtual address, wherein the translation lookaside buffer entry includesthe global address, wherein the operating system selects the translationfunction in order to map the virtual address to the local physicaladdress within a non-replicated range of local physical addresses of thememory subsystem.
 15. The method of claim 14, further comprising theoperating system executing on the active device in one of the nodescreating the translation lookaside buffer entry corresponding to thevirtual address in response to deciding to replicate the coherency unitto the node from an additional one of the plurality of nodes.